Group 1 Final
1 Introduction
Heart disease is one of the most common diseases and a leading cause of death in the United States. This dataset takes data from the CDC for the year 2020 for people with and without heart disease. It includes health-related data including BMI, whether someone is a smoker, the amount of physical activity, age, race, and other variables. Our hope is that by studying how different health variables relate to instances of heart disease, we can determine if there are significant factors that can predict heart disease or are correlated with heart disease. In addition to this, the dataset includes a measure of mental health. We were interested in what factors can affect mental health. For instance, drinking, smoking, and physical activity were predicted to have some impact on overall mental health. Lastly, we want to look at the relation between BMI and physical activity. There have been some recent studies that BMI does not have any correlation to physical health, so we’d like to use this dataset to explore that relation.
1.1 Background
Healthy habits are defined as various terms that have been found in this database, such as: Eat a plant-based diet, average of sleep, mental health, physical activity and so forth. Furthermore, mental health tends to be related to physical health and this holistic relationship apparently has repercussions in diseases as serious as heart disease. This database shows an important relationship between the variables to know how are the decisions of the people interviewed and to estimate which habits lead to a degraded mental health, a physical health at risk, and aspects that even lead to heart disease.
1.2 Description of the Dataset
This data comes from a 2020 survey from the CDC on health status, used to study overall health and potential contributors to heart disease. The original dataset had 279 variables and over 400,000 rows, but the version which was uploaded to Kaggle contains 18 variables which could potentially influence heart disease and just over 320,000 complete rows, so there are no NA’s and all 18 of the variables were taken into account in some way. The 18 variables consist of the following: HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer. Most of these are straightforward variables that correspond with their name, but there are a few which require further explanation.
The people interviewed for this survey would answer “Yes” for Smoking if they have smoked at least 100 cigarettes in their entire lifetime and “Yes” to AlcoholDrinking if they are considered heavy drinkers (more than 14 drinks per week for men and 7 for women). PhysicalHealth and MentalHealth are numerical variables which give the number days in the past 30 days during which their physical or mental health, respectively, could be considered not good. That means that lower values correspond to less days of poor health. Recipients answered “Yes” to DiffWalking if they have any difficulty walking or climbing stairs. Diabetic is a four-level factor variable which records if they have ever had diabetes with the following responses: “No”, “Borderline”, “Yes (during pregnancy)”, or “Yes”. PhysicalActivity records whether the recipients reported any physical activity in the past 30 days outside of their regular job. The rest of the variables should be relatively self-explanatory given their names.
2 Understanding the Data
2.1 Dataset Summary
Importing the dataset and original data structure:
## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : chr "No" "No" "No" "No" ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "Yes" "No" "No" ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "Yes" "No" "No" "Yes" ...
2.2 Cleaning the Dataset
*All but five of the variables were set to factors. Most factor variable had 2 levels (yes or no questions), but some had up to six levels.
*In the variable Race, “American Indian/Alaskan Native” was redefined to “Native” in order to conserve space on plots and tables, but it should be noted that these two groups make up that level
*In the variable Diabetic, “No, borderline diabetes” was redefined to “Borderline” in order to conserve space. There is no information lost in doing this.
*An order to the factor variables Race, Diabetic, and GenHealth was established to keep the orders uniform across plots and tables. The order for Race is based on relative frequency (with “Other” being at the end) and the other two varriables were put in a logical order.
*The variable AgeCategory was replaced with Age so that it could be used as a numerical variable. A random value was chosen in the range given by AgeCategory, that value was set to Age, and unnecessary variables were deleted.
## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
## $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Stroke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
## $ Race : Factor w/ 6 levels "White","Hispanic",..: 1 1 1 1 1 3 1 1 1 1 ...
## $ Diabetic : Factor w/ 4 levels "No","Borderline",..: 4 1 4 1 1 1 1 4 2 1 ...
## $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
## $ GenHealth : Factor w/ 5 levels "Poor","Fair",..: 4 4 2 3 4 2 2 3 2 3 ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
## $ KidneyDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ SkinCancer : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
## $ Age : num 59 81 66 76 41 75 71 84 83 65 ...
3 Exploratory Data Analysis
3.1 Understanding the Data
These pie charts give the relative frequency of a few key factor variables. The show that a majority of people do not have heart disease, have not smoked, are not heavy drinkers, are white, and are female.
These histograms give a brief look at the numerical variables. BMI has an average value of 28.3 with a right skew. A majority of people reported 0 days of poor physical and mental health over the past 30 days. The average for sleep time and age is 7.1 hours and 54.6 years, respectively.
3.2 Smart Question: What variables affect instances of heart disease?
| V1 | |
|---|---|
| No | 292422 |
| Yes | 27373 |
| No | Yes | |
|---|---|---|
| No | 176551 | 115871 |
| Yes | 11336 | 16037 |
| No | Yes | |
|---|---|---|
| No | 271786 | 20636 |
| Yes | 26232 | 1141 |
| No | Borderline | Yes (during pregnancy) | Yes | |
|---|---|---|---|---|
| No | 252134 | 5992 | 2451 | 31845 |
| Yes | 17519 | 789 | 108 | 8957 |
| No | Yes | |
|---|---|---|
| No | 284742 | 7680 |
| Yes | 22984 | 4389 |
| No | Yes | |
|---|---|---|
| No | 258040 | 34382 |
| Yes | 17345 | 10028 |
| No | Yes | |
|---|---|---|
| No | 61954 | 230468 |
| Yes | 9884 | 17489 |
| No | Yes | |
|---|---|---|
| No | 254483 | 37939 |
| Yes | 22440 | 4933 |
| No | Yes | |
|---|---|---|
| No | 284098 | 8324 |
| Yes | 23918 | 3455 |
| No | Yes | |
|---|---|---|
| No | 267583 | 24839 |
| Yes | 22393 | 4980 |
These are various tables which compare factor variables with instances of heart disease. To summarize briefly, there were more instances of heart disease with people who had smoked, were heavy drinkers, had a stroke, had difficulty walking, were not physically active, had asthma, had kidney disease, had diabetes, and had skin cancer.
This boxplot looks at the distribution of BMI values for people with and without heart disease. The median BMI was 27.3 for people without heart disease and 28.3 for people with heart disease.
These two plots look at instances of heart disease compared to both age and sex. The median age of people with heart disease was 70 years versus 55 years for people without. The area plot demonstrates an increase in the percentage of people with heart disease as their age increases and also shows that men are more likely to have heart disease than women.
These two bar graphs and split violin plot look at general health for people with and without heart disease. The most common response was “Good” health for people with heart disease and “Very good” for people without heart disease. The violin plot confirms the relationship between heart disease and age while showing the distribution of people in each health category versus age in each case.
This split violin plot and accompanying bar graph look at race and age versus heart disease. The violin plot shows the unexpected result that non-white people report having heart disease at younger ages when compared to white people. The bar graph shows that a higher percentage of white people get heart disease (9.2%) and a lower percentage of Asian people (3.3%) have heart disease, with the other races falling somewhere between those two values.
3.3 Smart Question: What variables affect mental health?
These four boxplots look at what variables affect mental health. It should be noted that since over 60% of total recipients reported 0 days of poor mental health, those instances were omitted, so these plots look at people who have reported at least 1 day of poor mental health. They show that people have less poor mental health days when they don’t smoke, aren’t heavy drinkers, are physically active, and get the recommended amount of sleep (7 to 9 hours per night).
3.4 Smart Question: Does BMI have any effect on physical health?
This boxplot compares general health categories to recorded BMI. Those who reported “Excellent” health had the lowest median BMI (25.4) and those who reported “Fair” health had the highest median BMI (29.4).
4 Testing
4.1 Chi-Square Test
The Chi-square test of independence is a statistical hypothesis test used to determine whether two variables are likely to be related or not. We conduct a couple of chi-square test to check if the variables are independent or not.
4.1.2 Does the data support that race very much affects heart disease?
We want to check if the Heart Disease variable is related to the Race variable. We conduct another chi-square test to check whether Heart Disease and Race are independent. H0: Heart Disease and race are independent from each other. H1: Heart Disease and race are not independent from each other.
## HeartDisease
## Race No Yes
## American Indian/Alaskan Native 4660 542
## Asian 7802 266
## Black 21210 1729
## Hispanic 26003 1443
## Other 10042 886
## White 222705 22507
##
## Pearson's Chi-squared test
##
## data: racetable
## X-squared = 844, df = 5, p-value <0.0000000000000002
The table shows how many people are suffering from heart disease and otherwise according to the race. As stated by the table, white people are more likely to suffer from heart disease compare to other races. Additionally, the chi-square looked whether the variables are related or not, and with a p-value of 2.2e-16 we were able to reject null Hypothesis supporting evidence that race has effects on heart disease.
4.1.3 Does the data support that Heart Disease has an effect on Gen Health?
The last chi-square test conducted is to check if the variable heart disease is independent to general health. H0: Heart Disease and Gen Health are independent from each other. H1: Heart Disease and Gen Health are not independent from each other.| Excellent | Fair | Good | Poor | Very good | |
|---|---|---|---|---|---|
| No | 65342 | 27593 | 83571 | 7439 | 108477 |
| Yes | 1500 | 7084 | 9558 | 3850 | 5381 |
##
## Pearson's Chi-squared test
##
## data: gentable
## X-squared = 21542, df = 4, p-value <0.0000000000000002
The contingency table shows the number of active heart disease patients are heaving general health condition in five categories. The Chi-square is intended to show if the variables are related or not, and with a p-value of 2.2e-16 we reject the null Hypothesis leading to evidence to support that Heart Disease has an effect on Gen Health variable.
4.2 T-test
4.2.1 What is the average age people are suffering from heart disease?
On our data set Heart Disease is a factor variable and Age is numeric variable. Therefore, to find the average age of people having heart disease we have chosen the T-test for this purpose. A t-test compares the mean of the sample data to a known value. By conducting the t-test, the average value (mean value) of people’s age having heart disease was found. Subsequently, values from t-test are analyzed to find the average age of people suffering from heart diseases.
##
## One Sample t-test
##
## data: heart_disease_on$Age
## t = 929, df = 27372, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 68.0 68.3
## sample estimates:
## mean of x
## 68.2
After sub setting the dataset to split the values where HeartDisease factor variable is Yes, we conducted the t-test to know the average age.The result shows that the value average of having heart disease is 68.
4.3 Test For Association
The correlation test is used to evaluate the association between two or more variables. Pearson’s can range from −1 to 1, and an R-squared of −1 indicates a perfect negative linear relationship between variables, an R-squared of 0 indicates no linear relationship between variables, and an R-squared of 1 indicates a perfect positive linear relationship between variables.
4.3.1 What variables affect mental health physical health? In particular, does alcohol drinking, smoking
To know the effect of smoking and drinking on mental and physical health Pearson’s method of cor test is used.
##
## Pearson's product-moment correlation
##
## data: heartdata$MentalHealth and as.numeric(heartdata$Smoking)
## t = 48, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0817 0.0886
## sample estimates:
## cor
## 0.0852
##
## Pearson's product-moment correlation
##
## data: heartdata$MentalHealth and as.numeric(heartdata$AlcoholDrinking)
## t = 29, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0478 0.0547
## sample estimates:
## cor
## 0.0513
Smoking and drinking variables do not have enough strong correlation with mental health. Here, the Cor value of smoking is 0.08515729, drinking alcohol value is 0.05128197.Thus, smoking has a stronger correlation with mental health than drinking alcohol with the same.
##
## Pearson's product-moment correlation
##
## data: heartdata$PhysicalHealth and as.numeric(heartdata$Smoking)
## t = 66, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.112 0.119
## sample estimates:
## cor
## 0.115
##
## Pearson's product-moment correlation
##
## data: heartdata$PhysicalHealth and as.numeric(heartdata$AlcoholDrinking)
## t = -10, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0207 -0.0138
## sample estimates:
## cor
## -0.0173
Smoking and drinking variables do not have enough strong correlation with physical health. Here, the Cor value of smoking is 0.1153524, drinking alcohol value is -0.01725429. Thus, smoking has a stronger correlation with physical health than drinking alcohol, as this latter one is negatively correlated.
5 Model building
5.1 SMART Question
What variables affect instances of heart disease?
Our goal is to find out the people who are likely to have heart disease in the future, so we can take some actions like a more detailed physical examination before the conditions become worse.
5.2 Pre-processing amd balancing the data
The first step is to perform some pre-processing work.
First, because we will use bestglm::bestglm(), a feature selection method, to decide which variables are essential and which are not, we must clean the dataset with the target variable renamed y and all other unused variables removed from the dataset. Thus, we put the HeartDisease column at the end of the dataset and renamed it as y.
Second, considering that there are few rows with the value “Yes (during pregnancy)” in the Diabetic variable, we combine the value “Yes (during pregnancy)” and “Yes” together in the Diabetic variable.
The following codes are the two pre-processing steps in the model building part.
After preprocessing, we need to balance the data. Let us look at the proportion of heart disease data before we continue our research.
##
## No Yes
## 292422 27373
We can find that the dataset is very unbalanced. Only 8.6% of the dataset has the value of 1 for y (HeartDisease). Considering that the dataset is large, we use undersampling methods to balance the dataset. After the balancing work, the value zero and value one of y variable are the same. We have used the following reference for different balancing methods: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
We can find that the data is really balanced now.
##
## No Yes
## 27373 27373
Before we begin the logistic regression model, let us look at the structure of the dataset now.
## 'data.frame': 54746 obs. of 18 variables:
## $ BMI : num 30.9 21.9 25.8 25.8 30.1 ...
## $ Smoking : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
## $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ Stroke : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ PhysicalHealth : num 0 0 1 5 1 0 0 0 3 0 ...
## $ MentalHealth : num 0 0 0 0 0 0 0 3 10 0 ...
## $ DiffWalking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 2 1 1 1 ...
## $ Race : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 4 6 6 6 6 6 6 ...
## $ Diabetic : Factor w/ 3 levels "No","No, borderline diabetes",..: 1 1 1 1 1 1 1 1 3 1 ...
## $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ GenHealth : num 4 4 3 4 3 3 2 3 2 3 ...
## $ SleepTime : num 8 6 8 6 6 8 5 8 6 8 ...
## $ Asthma : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ KidneyDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ SkinCancer : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ Age : num 58 28 82 34 59 66 27 31 61 60 ...
## $ y : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
5.3 logistic regression model
We split the dataset into two parts to train and evaluate the model later. 80% of the dataset will be used to train the model, and the rest (20%) will be used to test the model’s accuracy. I will use createDataPartition in the caret library to split the dataset.
After having the data split, the training dataset is used to build the model. First, we use all the variables as independent variables and make a model as below.
##
## Call:
## glm(formula = y ~ ., family = binomial(logit), data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9445 -0.7856 -0.0188 0.8142 2.9983
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.383373 0.144227 -23.46
## BMI 0.011842 0.001979 5.98
## SmokingYes 0.390439 0.024282 16.08
## AlcoholDrinkingYes -0.174367 0.053397 -3.27
## StrokeYes 1.191791 0.050745 23.49
## PhysicalHealth 0.003855 0.001549 2.49
## MentalHealth 0.005636 0.001583 3.56
## DiffWalkingYes 0.212683 0.033270 6.39
## SexMale 0.737279 0.024622 29.94
## RaceAsian -0.481915 0.133290 -3.62
## RaceBlack -0.372107 0.101400 -3.67
## RaceHispanic -0.210721 0.102362 -2.06
## RaceOther -0.176508 0.112251 -1.57
## RaceWhite -0.195594 0.091659 -2.13
## DiabeticNo, borderline diabetes 0.193157 0.074438 2.59
## DiabeticYes 0.484186 0.030476 15.89
## PhysicalActivityYes -0.033862 0.028341 -1.19
## GenHealth -0.502894 0.014087 -35.70
## SleepTime -0.030457 0.007679 -3.97
## AsthmaYes 0.299073 0.034472 8.68
## KidneyDiseaseYes 0.663414 0.052078 12.74
## SkinCancerYes 0.145502 0.035878 4.06
## Age 0.058646 0.000947 61.91
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## BMI 0.00000000220 ***
## SmokingYes < 0.0000000000000002 ***
## AlcoholDrinkingYes 0.00109 **
## StrokeYes < 0.0000000000000002 ***
## PhysicalHealth 0.01279 *
## MentalHealth 0.00037 ***
## DiffWalkingYes 0.00000000016 ***
## SexMale < 0.0000000000000002 ***
## RaceAsian 0.00030 ***
## RaceBlack 0.00024 ***
## RaceHispanic 0.03953 *
## RaceOther 0.11585
## RaceWhite 0.03285 *
## DiabeticNo, borderline diabetes 0.00946 **
## DiabeticYes < 0.0000000000000002 ***
## PhysicalActivityYes 0.23216
## GenHealth < 0.0000000000000002 ***
## SleepTime 0.00007306129 ***
## AsthmaYes < 0.0000000000000002 ***
## KidneyDiseaseYes < 0.0000000000000002 ***
## SkinCancerYes 0.00005003125 ***
## Age < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60714 on 43795 degrees of freedom
## Residual deviance: 43297 on 43773 degrees of freedom
## AIC: 43343
##
## Number of Fisher Scoring iterations: 5
We can find from the model that the p-values of Race and PhysicalActivity are more significant than 0.05, which means these two variables are insignificant. So we drop these two variables and make the second logistic regression model again.
We will quickly check two things for this model. First, the p-values. Since a P-value below .05 indicates significance, which means the coefficient or so-called parameters that our model estimates are reliable. And second, the pseudo R squared. This value ranging from 0 to 1 indicates how much variance our model explains.
We can find that all the p-values of the model indicate significance, meaning that our model is a legitimate one. An R squared of 0.29 tells that 29 percent of the variance is explained.
After we finish this, we can have a look at the Variance Inflation Factor (vif).
- When 1 < vif < 5, it means the variables are mildly correlated. It’s acceptable.
- When 5 < vif < 10, it means moderately correlated, and it also can be acceptable.
- When vif > 10, it’s not acceptable.
## BMI SmokingYes
## 7.12 6.38
## AlcoholDrinkingYes StrokeYes
## 6.52 9.62
## PhysicalHealth MentalHealth
## 10.43 8.06
## DiffWalkingYes SexMale
## 8.71 6.58
## DiabeticNo, borderline diabetes DiabeticYes
## 5.59 7.09
## GenHealth SleepTime
## 11.09 6.68
## AsthmaYes KidneyDiseaseYes
## 6.80 8.44
## SkinCancerYes Age
## 6.40 11.15
We can find that some vif values are larger than 10, which means these variables are highly correlated and not acceptable. So we tried to drop one variable at a time. In the meantime, we checked at the p-value to ensure that the variables are significant. In the end, we got the model below:
##
## Call:
## glm(formula = y ~ . - Race - PhysicalActivity - Age - Asthma -
## PhysicalHealth, family = binomial(logit), data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.370 -0.858 -0.120 0.941 2.297
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.21961 0.08471 2.59
## BMI -0.00625 0.00181 -3.45
## SmokingYes 0.48167 0.02257 21.34
## AlcoholDrinkingYes -0.40284 0.04980 -8.09
## StrokeYes 1.36603 0.04933 27.69
## MentalHealth -0.01653 0.00141 -11.74
## DiffWalkingYes 0.62529 0.03037 20.59
## SexMale 0.59018 0.02277 25.91
## DiabeticNo, borderline diabetes 0.42327 0.07167 5.91
## DiabeticYes 0.71741 0.02907 24.68
## GenHealth -0.57790 0.01223 -47.24
## SleepTime 0.03334 0.00728 4.58
## KidneyDiseaseYes 0.85136 0.05099 16.70
## SkinCancerYes 0.74038 0.03395 21.81
## Pr(>|z|)
## (Intercept) 0.00953 **
## BMI 0.00055 ***
## SmokingYes < 0.0000000000000002 ***
## AlcoholDrinkingYes 0.0000000000000006 ***
## StrokeYes < 0.0000000000000002 ***
## MentalHealth < 0.0000000000000002 ***
## DiffWalkingYes < 0.0000000000000002 ***
## SexMale < 0.0000000000000002 ***
## DiabeticNo, borderline diabetes 0.0000000035125791 ***
## DiabeticYes < 0.0000000000000002 ***
## GenHealth < 0.0000000000000002 ***
## SleepTime 0.0000047081971992 ***
## KidneyDiseaseYes < 0.0000000000000002 ***
## SkinCancerYes < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60714 on 43795 degrees of freedom
## Residual deviance: 48335 on 43782 degrees of freedom
## AIC: 48363
##
## Number of Fisher Scoring iterations: 4
And also checked the vif. We can find that all the vif values are all below 10.
## BMI SmokingYes
## 6.03 5.58
## AlcoholDrinkingYes StrokeYes
## 5.68 9.12
## MentalHealth DiffWalkingYes
## 6.39 7.44
## SexMale DiabeticNo, borderline diabetes
## 5.66 5.21
## DiabeticYes GenHealth
## 6.51 8.53
## SleepTime KidneyDiseaseYes
## 6.02 8.09
## SkinCancerYes
## 5.81
5.4 Feature selection
In this part, we want to use feature selection to find out the most suitable variables from our current model. Unfortunately, the training dataset has more than 40,000 rows, which is significant and takes much time to run. So we changed the test dataset to make the feature selection. The test dataset has the same data structure but fewer rows.
Although lacking intuitive visual presentation of results, bestglm::bestglm() can handle logistic regression. Thus, we used it to do the feature selection.
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 10936 12036
## Full Model 10949 15180
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 3144, df = 13, p-value <0.0000000000000002
## BMI Smoking AlcoholDrinking Stroke MentalHealth DiffWalking Sex Diabetic
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 5 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## GenHealth SleepTime KidneyDisease SkinCancer Criterion
## 1 TRUE TRUE TRUE TRUE 12062
## 2 TRUE FALSE TRUE TRUE 12064
## 3 TRUE TRUE TRUE TRUE 12067
## 4 TRUE FALSE TRUE TRUE 12070
## 5 TRUE TRUE TRUE TRUE 12070
## BMI Smoking AlcoholDrinking Stroke MentalHealth
## Mode :logical Mode:logical Mode :logical Mode:logical Mode:logical
## FALSE:1 TRUE:5 FALSE:2 TRUE:5 TRUE:5
## TRUE :4 TRUE :3
##
##
##
## DiffWalking Sex Diabetic GenHealth SleepTime
## Mode:logical Mode:logical Mode:logical Mode:logical Mode :logical
## TRUE:5 TRUE:5 TRUE:5 TRUE:5 FALSE:2
## TRUE :3
##
##
##
## KidneyDisease SkinCancer Criterion
## Mode:logical Mode:logical Min. :12062
## TRUE:5 TRUE:5 1st Qu.:12064
## Median :12067
## Mean :12067
## 3rd Qu.:12070
## Max. :12070
The feature selection shows that the best model has all these 13 variables, and its CIA (Akaike Information Criterion) is 12062, which is the lowest among these models.
5.5 Model Evaluation
In this part, we will use AUC and confusion matrix to evaluate the model.
5.5.1 ROC and AUC
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measure the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 are considered a good model fit.
The AUC of the model is 0.795, which is a little bit lower than 0.8. Because our model looks suitable and we have all the needed features, we assume that the data causes the lower AUC value.
5.5.2 Confusion matrix
We can then have a look at the Confusion matrix.
| Predicted No | Predicted Yes | Total | |
|---|---|---|---|
| Actual No | 16646 | 5252 | 21898 |
| Actual Yes | 6957 | 14941 | 21898 |
| Total | 23603 | 20193 | 43796 |
We can find from the confusion matrix that Precision is 14941/(5252+14941) = 0.74, which means the valid of the result is 74%. And the recall is 14941/(6957+14941) = 0.68, which means how complete the results are 68%.
In our model, actually, we consider the recall is more important, because FN means heart disease patients who are missed by our model, which can cause a harmful result.
## No Yes
## 0 4220 1753
## 1 1255 3722
Then we can use the test dataset to checkout whether the model is good to use. So I used the data_test to make a prediction and calculate the confusion matrix by the test dataset. The Precision is 3722/(1255+3722) = 0.75 and the recall is 3722/(1753+3722) = 0.68.
The Precision value and recall value of the test dataset is similiar to the train dataset, which means our model is reliable to predict the heart disease.
5.6 Interpretation and Reporting
We’ll return to our logistic regression model for a minute, and look at the estimated parameters (coefficients). Since the model’s parameter the recorded in logit format, we transformed it into odds ratio so that it’ll be easier to interpret. After transforming, we sorted the variables by the coefficient values.
## # A tibble: 14 × 3
## term estimate statistic
## <chr> <dbl> <dbl>
## 1 StrokeYes 3.92 27.7
## 2 KidneyDiseaseYes 2.34 16.7
## 3 SkinCancerYes 2.10 21.8
## 4 DiabeticYes 2.05 24.7
## 5 DiffWalkingYes 1.87 20.6
## 6 SexMale 1.80 25.9
## 7 SmokingYes 1.62 21.3
## 8 DiabeticNo, borderline diabetes 1.53 5.91
## 9 (Intercept) 1.25 2.59
## 10 SleepTime 1.03 4.58
## 11 BMI 0.994 -3.45
## 12 MentalHealth 0.984 -11.7
## 13 AlcoholDrinkingYes 0.668 -8.09
## 14 GenHealth 0.561 -47.2
We can find from the table that other diseases (stroke, kidney disease, diabetic, SkinCancer), general health conditions, sex, DiffWalking (serious difficulty walking or climbing stairs), and smoking habit all largely influence the possibility of heart disease. It’s a little weird that drinking alcohol will reduce the possibility of heart disease. As we always think, drinking is not a good habit. Maybe the data also includes people who drink some little wine.
6 Classification Trees
6.1 First Classification Tree
For the purposes of the decision tree, observations were assigned for the variable mental health with categorical values “Yes” and “No”. Its initial values range between 1 and 30 as a response to how many days on the previous month people interviewed felt their mental health was not good. In this sense, a logic if else argument responds to the condition if people felt their mental health was no good for 15 or more days then assign value “Yes”, otherwise assign value “No” (MentalHealth>=15, “No”, “Yes”).
This procedure was pertinent for further process on creating the decision tree with the relevant variables. In this first decision tree the following variables were selected: HighMh, Smoking, Sex, AlcoholDrinking, and PhysicalActivity.
After sub setting the data a training model was created to predict the class or value of the target variable, which in this case is Smoking, by learning simple decision rules inferred from this data training.
With the training dataset created, a tree was built responding to smoking as the target value. The results show 13 different nodes for each variable. The actual tree starts with the root node labeled 1). observations and a default decision of No. There are 107000 observations with Yes as the decision, so these are lost if we make the decision No for all observations. The probability of No is reported as 0.58 and of Yes us 0.41. The root node is split into two branches, nodes number 2 and 4. For node number 2, the split corresponds to those observations for which AlcoholDrinking is equal to No. This accounts for 238097 observations and whilst 96200 of them are Yes. The majority (with a proportion of 0.596) are No. Going forward with interpreting the nodes, it is concluded that with a proportion of 60% people who drinks alcohol, and out of this 42% are male and practice physical activity, 41% percent reported more than 15 days in the prior month where their mental health seemed to be affected.
Graphically the tree looks like the following plot, and this visually represents, as another conclusion, that starting on node 5 representing 51% of people who do not workout at all throughout the month, there are 45% female from whom 42% felt their mental health was not good for 15+ days in the past 30 days. The following plot is more visually appealing and resumes the conclusions described above.
Additionally, from the tree fit function specific observations were selected to obtain the prediction and the rule used to make that prediction based on the target variable. The results are the following:
## Smoking
## 0.34 when AlcoholDrinking is No & PhysicalActivity is Yes & Sex is Female
## 0.41 when AlcoholDrinking is No & PhysicalActivity is Yes & Sex is Male & HighMH is Yes
## 0.43 when AlcoholDrinking is No & PhysicalActivity is No & Sex is Female & HighMH is Yes
## 0.52 when AlcoholDrinking is No & PhysicalActivity is Yes & Sex is Male & HighMH is No
## 0.55 when AlcoholDrinking is No & PhysicalActivity is No & Sex is Male
## 0.57 when AlcoholDrinking is No & PhysicalActivity is No & Sex is Female & HighMH is No
## 0.62 when AlcoholDrinking is Yes
6.2 Second Classification Tree
A second classification tree was built to understand the behavior of different variables interacting with the target variable smoking.The variables used to construct the model were: Age, Sleep Time, Race, Heart Disease, and Physical Activity.
For this classification tree, the variable Age was turned into ‘ifelse logic’ to see the pattern for people of 30+ years old, and the variable name assigned was: Age30Plus.
Similarly, the variable sleep was turned into ‘ifelse logic’ to see the pattern for people sleeping 7+ hours, and the variable name assigned was AvSleep.
After sub setting the data a training model was created to predict the class or value of the target variable, which in this case is Smoking, by learning simple decision rules inferred from this data training.
With the training dataset created, a tree was built responding to smoking as the target value. The results show 9 different nodes, starting with the root node labeled 1) observations and a default decision of No and this accounts for 58% of the data. This node splits into those who are 30 or more (22%) and those who are less than 30 years old (44%). Following the results, node 6 shows that 57% don’t have any heart disease, which leads to node 12 where 40% perform some sort of physical activity, and 49% do not perform any physical activity at all, which can be seen in node 13. Meanwhile, nodes 25 and 27 correspond to races from which node labeled as 26 indicates that 38% are Asian, Black or Hispanic, and node labeled 27 indicates54% are American Indian, Alaskan native, white or other.
Graphically the tree looks like the following plot, and this visually represents, as another conclusion, that surprisingly from node 3, for those less than 30 years old, there is 59% chance of having a heart disease without accounting the other variables.
The following plot is more visually appealing and resumes the conclusions described above.
Additionally, from the tree fit function specific observations were selected to obtain the prediction and the rule used to make that prediction based on the target variable. The results are the following:
## Smoking
## 0.22 when Age30Plus is Yes
## 0.39 when Age30Plus is No & HeartDisease is No & PhysicalActivity is No & Race is Asian or Black or Hispanic
## 0.41 when Age30Plus is No & HeartDisease is No & PhysicalActivity is Yes
## 0.54 when Age30Plus is No & HeartDisease is No & PhysicalActivity is No & Race is American Indian/Alaskan Native or Other or White
## 0.59 when Age30Plus is No & HeartDisease is Yes
7 Conclusion
According to the CDC, heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States (CDC, 2022). After performing the EDA, distributions, tests, regression models, and decision tree we concluded that smoking, strokes, asthma, difficulty walking, kidney disease, diabetes, and skin cancer affect instances of heart disease. In addition to this analysis found in the database, we wanted to explore the response that these instances have on mental health and physical health. For this analysis, we arrived at the following conclusions. First, the cutoff value of 0.15 is suitable for our logistic regression model, and through this model we found that general health condition, gender, other diseases and smoking habit variables largely influence the possibility of heart disease. Second, from the classification tree, not practicing any physical activity impacts the perception of people feeling that their mental health was not good (they were not feeling good for 15 or more days in the past month. Finally, From the classification tree, the probability of smoking for people 30 or older, even if they have a heart disease or not, is more than 50%. Surprisingly drinking alcohol does not have a great impact on any of the target variables we were looking at: heart disease, mental and physical health, however, smoking has an impact on all three of these aspects. Similarly, BMI as a single measure, did not have any effect on any of the target variables. Studies have shown that BMI would not be expected to identify cardiovascular health or illness overall (Harvard, 2022), and these same findings explain that body composition, including percent body fat or amount of muscle mass, can vary by race and ethnic group and thus wont impact on predicting current healt status.
8 References
Centers for Disease Control and Prevention. (2022). Heart Disease Facts. https://www.cdc.gov/heartdisease/facts.htm
Kaggle. Heart Disease Dataset. (2020). https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
Shmerling, R. (2020). How useful is the body mass index (BMI). Harvard Health Publishing. https://www.health.harvard.edu/blog/how-useful-is-the-body-mass-index-bmi-201603309339#:~:text=BMI%2C%20as%20a%20single%20measure,the%20only%20measure%20of%20health!
Walto, A. (2017). The 5 Key Habits For Long-Term Health, According To Science. Forbes. https://www.forbes.com/sites/alicegwalton/2017/07/27/the-5-habits-that-really-define-longterm-health-according-to-science/?sh=6833625f4286